Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Represent pandas ordered categoricals as ordinal data #2522

Merged
merged 7 commits into from
Jan 25, 2023

Conversation

joelostblom
Copy link
Contributor

A suggestion for how Altair could support ordered pandas categoricals. This doesn't deal with the order of segments in a stacked bar chart, but I think that is something that should be fixed in VL (vega/vega-lite#1734) rather than here.

close #245, ref #2170

@@ -193,8 +193,6 @@ def infer_vegalite_type(data):
# Otherwise, infer based on the dtype of the input
typ = infer_dtype(data)

# TODO: Once this returns 'O', please update test_select_x and test_select_y in test_api.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this refers to ordinal data, but I could not find these specific tests, but I updated the one test that seemed relevant.

.encode(
alt.X("x", type="nominal"),
alt.Y("y", type="ordinal"),
alt.Size("s", type="nominal", sort=None),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little annoying that the sort categories has to be set to None manually when changing the data type, but if the shorthand is used 's:O' then this happens automatically (and the shorthands are probably the most commonly used syntax for this).

Copy link
Contributor

@binste binste Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the PR and it looks good to me! Only thing where it could be nice to dig some more is this part. I think this could be solved by a custom logic in https://github.com/altair-viz/altair/blob/master/tools/schemapi/schemapi.py#L365. Pseudocode (sorry didn't have time to test this more):

if "sort" in parsed_shorthand and kwds.get("type") != "ordinal":
    parsed_shorthand.pop("sort")

Given that "sort" is only added in parsed_shorthand if an ordered pandas categorical was used, it seems safe to remove it here if the type is specified already.

Not sure if there are other places which would need to be modified. Btw, this code is only in schemapi.py after you merged in the latest changes from master due to #2813

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @binste ! I had to add Undefined to your suggested solution, otherwise it would incorrectly remove the sorting for categorical data where the type was not manually specified, does that makes sense to you? I rebased on master and pushed the latest changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that makes sense!

@joelostblom
Copy link
Contributor Author

Sorry I noticed that I forgot to click "Request review" here initially.

@nkrishnaswami
Copy link

Would love to see support for pandas ordinals! @jakevdp, ptal?

@mattijn
Copy link
Contributor

mattijn commented Dec 27, 2022

I noticed that the changes to the tests are aiming vegalite/v4, you probably want to change this to vegalite/v5 if you think this PR should be merged @joelostblom.

@joelostblom
Copy link
Contributor Author

@mattijn I moved the tests to v5 instead of v4 and also updated the docs. Feel free to give this a review if you have the time.

Copy link
Contributor

@binste binste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment as a reply to yours.

@mattijn
Copy link
Contributor

mattijn commented Jan 24, 2023

I looked to the code changes and looked to the related issues PRs, but I still feel I miss a bit of context. Would it be possible to include a comment with a very simple example of the behavior before this PR and the behavior incorporating the changes of this PR?

@binste
Copy link
Contributor

binste commented Jan 25, 2023

Example

Based on this comment.

Data and imports:

import altair as alt
import pandas as pd
from vega_datasets import data

source = data.barley()

If a field is passed in as an ordered pd.Categorical such as

site_lst = ['Crookston', 'Morris', 'University Farm','Duluth', 'Grand Rapids', 'Waseca']
source.site = pd.Categorical(source.site, site_lst, ordered = True)

Altair so far ignored this ordering and just treated it as a nominal field as if it would be a column of strings.

This means that in the current master branch, the colors are not sorted according to site_lst and the type of site is inferred as nominal:

alt.Chart(source).mark_bar().encode(
    x='sum(yield)',
    y='variety',
    color="site",
)

image

Running the same code again on the branch of this PR, Altair does two things differently:

  • it will choose ordinal as the type for site
  • and it will set the sort argument to the order defined in site_lst

image

So this PR basically expands color='site to color=alt.Color('site:O', sort=alt.Sort(site_lst)) (note the :O as well as sort).

To sort the stacked bar charts as well, one still needs to use the trick documented by @mattijn here. Note that this is maybe not a good example to show the use of ordinal data as the data is probably best represented as nominal to get a qualitative color scheme and with a sort attribute on Color to get the desired sorting. But I hope this helps in reviewing.

@binste
Copy link
Contributor

binste commented Jan 25, 2023

A better example is probably the one in this comment.

import altair as alt
import pandas as pd

dfdict = {
    "Questions": {
        0: "Question Text",
        1: "Question Text",
        2: "Question Text",
        3: "Question Text",
        4: "Question Text",
        5: "Question Text",
    },
    "level": {
        0: "1 - Strongly Disagree",
        1: "2 - Disagree",
        2: "3 - Neutral",
        3: "4 - Agree",
        4: "5 - Strongly Agree",
        5: "N/A",
    },
    "value": {0: 1.4, 1: 5.7, 2: 10.0, 3: 32.9, 4: 47.1, 5: 2.9},
}

df = pd.DataFrame(dfdict)

sort_order = [
    "N/A",
    "5 - Strongly Agree",
    "4 - Agree",
    "3 - Neutral",
    "2 - Disagree",
    "1 - Strongly Disagree",
]

# This doesn't seem to be needed
df["level"] = pd.Categorical(df["level"], sort_order, ordered=True)

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("value"),
        y=alt.Y("Questions", title=""),
        color="level", 
        order=alt.Order("color_site_sort_index:Q"),
    )
)

chart

Behaviour with this PR:
image

You can see that color is represented as ordinal and with the colors sorted according to sort_order.

Copy link
Contributor

@mattijn mattijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the PR. I think it is useful, if you prepare your pandas dataframe nicely using ordered categoricals it is nice if that would be respected by Altair.

Although it would be a first that the sort encoding channel option would be set based on the type of a pandas data frame column.

I found one more related comment to this suggested change (#899 (comment)):

Yeah, the fundamental issue is that JSON does not have the concept of ordered categoricals, and Vega ingests data as JSON. We could try to do something like infer the sort order from pandas dataframes the same way we infer the data type and then add sort info to the schema in the appropriate place, but that seems a bit too magic to me, and quite likely to lead to unintended side-effects.

The unintended side-effects.. what can it be? Eg. if I have a column with ordered categorical values. What will happen if I define it as type :N. Would Altair in that case still respect the sorting?

altair/utils/core.py Outdated Show resolved Hide resolved
@binste
Copy link
Contributor

binste commented Jan 25, 2023

If a different data type is specified such as :N then sort is not set and the behaviour is the same as on the master branch. This happens here https://github.com/altair-viz/altair/pull/2522/files#diff-25906c2570b127f0bb9cff4550bcc153006b2fdb36aea55ae804ca9a74eb4404R367. Just tested and verified this with an example.

I guess it can be surprising to people who are used to that only the datatypes are inferred. But converting a column in pandas to an ordered categorical one is usually a rather conscious decision so I'd expect them to either be happy about the choice that Vega-Lite then makes in terms of color scale and ordering or that they at least figure out how to fix it. Certainly a feature which would be great to have in the release candidate.

@mattijn mattijn merged commit 97ff1eb into vega:master Jan 25, 2023
@mattijn
Copy link
Contributor

mattijn commented Jan 25, 2023

I agree. Since the introduced behaviour is only targeting pandas data frames with ordered categorical columns the unintended consequences are, as far I can see, limited.
But the introduced behaviour of this PR will probably appreciated if respected by Altair by the limited user-base who make use of ordered categorical columns. All tests pass, so its merged.
Thanks @joelostblom for this PR, thanks @binste for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support of ordinal based on pandas' ordered Categorical type?
4 participants